Comparative Network Analysis of CPTAC Proteomics Data in Lung Squamous Cell Carcinoma

Author

Sofia Guglielmini

Background

Proteomic measurements provide insight into cellular functions by quantifying the abundance of proteins, which in turn mediate biological processes. Proteins often interact as parts of complex networks, regulating their behavior in a coordinated way. In diseased conditions, dysregulation often manifests as an alteration of the coordination system among proteins, in addition to absolute changes in protein abundance (Zhang et al. 2018). Network models offer an effective framework for studying such coordinated behavior by estimating the conditional dependence structure among proteins. Here, we perform a comparative network analysis of lung squamous cell carcinoma (LSCC) proteomics data from the CPTAC Pan-Cancer project. The analysis focuses on differences between normal and tumor samples, with emphasis on network topology, hub structure, edge-level changes, robustness, and biological interpretability.

Data

We analyze harmonized gene-level proteomics Pan-Cancer data generated by the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) and processed using the Baylor College of Medicine pipeline (see STAR methods, Li et al. 2023). Protein abundances are measured using LC-MS/MS-based mass spectrometry and aggregated to gene-level protein abundance, estimated by aggregating peptide intensities.

The analysis was restricted to the LSCC cohort and genes were restricted to the KEGG cancer-related pathways (hsa05200).

Proteomics measurements were furthermore filtered to keep genes with at most 20% missing values, while the remaining missing values were imputed using k-nearest neighbors (k = 10). Outlier genes (kurtosis > 10) and genes with zero variance were removed prior to network estimation. The data was scaled and centered prior to network inference.

After preprocessing, the dataset includes 207 samples and 304 genes. A summary of sample sizes by condition is provided below.

Dataset N Samples N Genes
0 All Samples 207 304
1 Normal 99 304
2 Tumor 108 304

Correlation Matrices

We start by displaying the empirical correlation matrix of the processed data by condition (normal vs tumor), shown below. We also display the difference between the two correlation matrices (tumor - normal).

Both correlation matrices are visually sparse, with a mix of positive and negative correlations. The difference matrix reveals both strengthened and weakened pairwise correlations in tumor samples, suggesting complex rewiring of protein interactions in cancer. No evident block structure is observed in either condition or in the difference matrix.

Network Analysis

Correlation matrices only summarize pairwise relationships and do not account for indirect effects. To capture direct protein-protein interactions, protein networks were inferred using sparse Gaussian graphical models (GGMs). Under this (Gaussian copula) framework, edges represent conditional dependencies between proteins, quantified via partial correlations derived from the estimated inverse covariance (precision) matrix. Due to the high-dimensional setting, lasso regularization was used to induce sparsity and permit estimation. The penalty parameter is manually tuned to \(2.5\sqrt{\log(p)/n}\), where \(p\) is the number of genes and \(n\) the number of samples in each condition (Friedman et al., 2008).

As proteomic measurements are aggregated at the gene level, throughout the analysis, network nodes represent genes, with protein-level measurements and annotations collapsed per gene.

Separate networks for normal samples and tumor samples are estimated, and shown below. Communities are detected using the Louvain method for community detection. Nodes are scaled by weighted degree, and colored by community membership. Edge widths are scaled by absolute partial correlation.

Excluding isolated nodes, in the normal network we identify 5 communities, while in the tumor network we identify 9 communities. The tumor network exhibits higher density and overall connectivity compared to the normal network, suggesting increased coordination among proteins in the cancer state. Weighted degree is defined as the sum of the absolute partial correlations corresponding to edges connected to each node, capturing both the number and strength of conditional associations.

Metric Normal Tumor
0 Proportion_Nonzero_Edges 0.025 0.040
1 Average_Node_Degree 3.859 6.099
2 Mean_Weighted_Degree 0.111 0.243

Centrality, Stability and Biological Annotation

Node centrality was quantified using weighted degree. However, sparse network estimation is sensitive to sampling variability. To assess the robustness of hub identification, weighted-degree stability was evaluated using subsampling. First, global networks were repeatedly estimated on random subsamples (80% of observations), and the mean weighted degree (MWD) and coefficient of variation (CV, i.e. ratio of standard deviation to mean) were computed for each gene. Hub genes were defined as those with high mean weighted degree. The CV was used to assess stability, with lower CV indicating more robust estimates. The top 10 hub genes in each condition are summarized below. The library mygene was used for annotation of the full gene names. The table below displays the top 10 hub genes ranked by weighted degree in the global network.

Hubs in Normal Network

Gene Mean_Weighted_Degree CV Gene_Name
0 GNAQ 1.862180 0.341138 G protein subunit alpha q
1 IL6ST 1.599017 0.204224 interleukin 6 cytokine family signal transducer
2 TGFBR2 1.579104 0.236497 transforming growth factor beta receptor 2
3 GNG11 1.576459 0.306760 G protein subunit gamma 11
4 PLCB1 1.569517 0.336525 phospholipase C beta 1
5 MAPK3 1.518865 0.263301 mitogen-activated protein kinase 3
6 LAMB2 1.467426 0.346506 laminin subunit beta 2
7 GNA11 1.373241 0.423603 G protein subunit alpha 11
8 APPL1 1.329453 0.246365 adaptor protein, phosphotyrosine interacting w...
9 COL4A1 1.309554 0.280163 collagen type IV alpha 1 chain

Hubs in Tumor Network

Gene Mean_Weighted_Degree CV Gene_Name
0 PRKCB 1.633496 0.162962 protein kinase C beta
1 MSH6 1.516476 0.314676 mutS homolog 6
2 MSH2 1.502637 0.330386 mutS homolog 2
3 JUP 1.398428 0.320394 junction plakoglobin
4 SPI1 1.312981 0.341773 Spi-1 proto-oncogene
5 PIK3CD 1.287251 0.193954 phosphatidylinositol-4,5-bisphosphate 3-kinase...
6 GNAI2 1.245188 0.275861 G protein subunit alpha i2
7 IL6ST 1.231422 0.237959 interleukin 6 cytokine family signal transducer
8 CDH1 1.217505 0.367396 cadherin 1
9 JAK1 1.216179 0.326077 Janus kinase 1

The main hubs in the normal network include genes encoding G-protein subunits (GNAQ, GNA11, GNG11) and downstream signaling components (PLCB1, MAPK3). Additional hubs include genes involved in cytokine and growth factor signaling (IL6ST, TGFBR2), as well as genes associated with extracellular matrix structure and cell–matrix interactions (LAMB2, COL4A1).

The hubs in the tumor network include genes associated with DNA mismatch repair (MSH2, MSH6), kinase-mediated signaling (PRKCB, PIK3CD), and transcriptional regulation (SPI1).

The functional descriptions of key hub genes were obtained from UniProt (The UniProt Consortium, 2025, www.uniprot.org).

We then compared weighted degree between normal and tumor networks to identify genes with altered centrality. The top 10 genes ranked by difference in average (bootstrapped) weighted degree between conditions are summarized below.

Hub Differences Between Conditions

Gene MWD_Normal CV_Normal MWD_Tumor CV_Tumor Difference Gene_Name
0 PRKCB 0.000 NaN 1.633 0.163 1.633 protein kinase C beta
1 MSH2 0.188 0.432 1.503 0.330 1.315 mutS homolog 2
2 SPI1 0.012 3.357 1.313 0.342 1.301 Spi-1 proto-oncogene
3 PIK3CD 0.000 NaN 1.287 0.194 1.287 phosphatidylinositol-4,5-bisphosphate 3-kinase...
4 GNAQ 1.862 0.341 0.625 0.479 -1.237 G protein subunit alpha q
5 CDH1 0.003 5.794 1.218 0.367 1.214 cadherin 1
6 MSH6 0.308 0.314 1.516 0.315 1.209 mutS homolog 6
7 JUP 0.211 0.551 1.398 0.320 1.188 junction plakoglobin
8 NFKB1 0.002 4.370 1.179 0.469 1.177 nuclear factor kappa B subunit 1
9 LAMB3 0.000 NaN 1.155 0.220 1.155 laminin subunit beta 3

Several genes (PRKCB, MSH2, MSH6, SPI1, PIK3CD, CDH1, JUP, NFKB1, LAMB3) change from isolated or weakly connected nodes in the normal network to highly connected nodes in the tumor network. In contrast, GNAQ shows a marked decrease in weighted degree in the tumor condition. These shifts indicate gene-specific redistribution of network connectivity between conditions.

Edge-level rewiring

To characterize network rewiring beyond node-level changes, we examined differences in partial correlations between normal and tumor networks. The top edges ranked by absolute change are visualized below. Edge changes indicate potential rewiring of protein coordination in the tumor condition.

All the largest changes involve increased partial correlations in tumor compared to normal, suggesting strengthened direct interactions in the cancer state.

Pathway Analysis

Since the samples are from LSCC patients, we consider the subset of genes which are in the KEGG pathway related to non-small cell lung cancer (hsa05223), which includes LSCC. For these genes, we observe a notable increase in centrality, total connectivity (sum of absolute partial correlations) and betweenness. Furthermore, we quantify how the pathway group is confirmed by the community detection algorithm in each condition, using the Herfindahl-Hirschman Index as a measure of concentration (Cruz et al., 2025; Herfindahl, 1951). The index is defined as the sum of squares of share of pathway genes in each cluster. The concentration is much higher in tumor compared to normal, indicating that the pathway genes are more concentrated within specific communities in the tumor network. This is consistent with the idea of increased coordination among cancer-related proteins in the tumor state.

Metric Normal Tumor Difference Percent_Change
0 Mean_Weighted_Degree 0.167 0.211 0.044 26.58
1 Total_Connectivity 50.791 64.292 13.501 26.58
2 Mean_Betweenness 0.002 0.005 0.003 204.70
3 Concentration_Index 0.009 0.102 0.093 1036.84

Limitations and Future Directions

A main limitation of this analysis (partially addressed by the resampling-based stability analysis) is the relatively small sample size, which may limit statistical power and robustness of network estimates. Joint estimation of the networks in the two conditions could be considered to borrow strength across conditions and improve estimation accuracy (Guo et al., 2011).

Future work could extend the current framework to larger cohorts, including pan-cancer analyses, to assess how the findings generalize across cancer types. Integrating multi-omics data (e.g. transcriptomics, genomics) would allow investigation of whether network alterations observed at the proteomic level are supported by other molecular levels.

Finally, statistical inference for network differences was not conducted in this study. Due to the data-driven model selection, standard inference methods do not apply. Data splitting, data thinning (Dharamshi et al., 2025) or post-selection inference (Guglielmini, Claeskens, 2025) can be used to address the issue, enabling valid uncertainty quantification for network properties and group differences.

References

Cruz J, Sun WY, Verbeke A, Hariharan IK. Single-cell transcriptomics of X-ray irradiated Drosophila wing discs reveals heterogeneity related to cell-cycle status and cell location. bioRxiv [Preprint]. 2025 Feb 11:2024.12.10.627868.

Dharamshi A, Neufeld A, Motwani K, Gao LL, Witten D, Bien J. Generalized data thinning using sufficient statistics. Journal of the American Statistical Association. 2025 Jan 2;120(549):511-23.

Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008 Jul;9(3):432-41.

Guglielmini S, Claeskens G. Asymptotic post-selection inference for regularized graphical models. Statistics and Computing. 2025 Apr;35(2):36.

Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2011 Mar 1;98(1):1-5.

Herfindahl, Orris C. Concentration in the Steel Industry. University Microfilms, 1951.

The UniProt Consortium. UniProt: the universal protein knowledgebase in 2025. Nucleic acids research 53, no. D1 (2025): D609-D617.